[MUSIC]
So now let's talk about the problem
a little bit more, and specifically let's
talk about the two different ways
of estimating the parameters.
One is called the Maximum Likelihood
estimate that I already just mentioned.
The other is Bayesian estimation.
So in maximum likelihood estimation,
we define best as
meaning the data likelihood
has reached the maximum.
So formally it's given
by this expression here,
where we define the estimate as a arg
max of the probability of x given theta.
So, arg max here just means its
actually a function that will turn.
The argument that gives the function
maximum value, adds the value.
So the value of arg max is not
the value of this function.
But rather, the argument that has
made it the function reaches maximum.
So in this case the value
of arg max is theta.
It's the theta that makes the probability
of X, given theta, reach it's maximum.
So this estimate that in due it also
makes sense and it's often very useful,
and it seeks the premise
that best explains the data.
But it has a problem, when the data
is too small because when the data
points are too small,
there are very few data points.
The sample is small,
then if we trust data in entirely and
try to fit the data and
then we'll be biased.
So in the case of text data,
let's say, all observed 100
words did not contain another
word related to text mining.
Now, our maximum likelihood estimator
will give that word a zero probability.
Because giving the non-zero probability
would take away probability
mass from some observer word.
Which obviously is not optimal in
terms of maximizing the likelihood of
the observer data.
But this zero probability for
all the unseen words may not
be reasonable sometimes.
Especially, if we want the distribution
to characterize the topic of text mining.
So one way to address this problem is
actually to use Bayesian estimation,
where we actually would look
at the both the data, and
our prior knowledge about the parameters.
We assume that we have some prior
belief about the parameters.
Now in this case of course, so we are not
going to look at just the data,
but also look at the prior.
So the prior here is
defined by P of theta, and
this means, we will impose some
preference on certain theta's of others.
And by using Bayes Rule,
that I have shown here,
we can then combine
the likelihood function.
With the prior to give us this
posterior probability of the parameter.
Now, a full explanation of Bayes rule,
and some of these things
related to Bayesian reasoning,
would be outside the scope of this course.
But I just gave a brief
introduction because this is
general knowledge that
might be useful to you.
The Bayes Rule is basically defined here,
and
allows us to write down one
conditional probability of X
given Y in terms of the conditional
probability of Y given X.
And you can see the two probabilities
are different in the order
of the two variables.
But often the rule is used for
making inferences
of the variable, so
let's take a look at it again.
We can assume that p(X) Encodes
our prior belief about X.
That means before we observe any other
data, that's our belief about X,
what we believe some X values have
higher probability than others.
And this probability of X given Y
is a conditional probability, and
this is our posterior belief about X.
Because this is our belief about X
values after we have observed the Y.
Given that we have observed the Y,
now what do we believe about X?
Now, do we believe some values have
higher probabilities than others?
Now the two probabilities
are related through this one,
this can be regarded as the probability of
the observed evidence Y,
given a particular X.
So you can think about X
as our hypothesis, and
we have some prior belief about
which hypothesis to choose.
And after we have observed Y,
we will update our belief and
this updating formula is based
on the combination of our prior.
And the likelihood of observing
this Y if X is indeed true,
so much for detour about Bayes Rule.
In our case, what we are interested
in is inferring the theta values.
So, we have a prior here that includes
our prior knowledge about the parameters.
And then we have the data likelihood here,
that would tell us which parameter
value can explain the data well.
The posterior probability
combines both of them,
so it represents a compromise
of the the two preferences.
And in such a case, we can maximize
this posterior probability.
To find this theta that would
maximize this posterior probability,
and this estimator is called a Maximum
a Posteriori, or MAP estimate.
And this estimator is
a more general estimator than
the maximum likelihood estimator.
Because if we define our prior
as a noninformative prior,
meaning that it's uniform
over all the theta values.
No preference, then we basically would go
back to the maximum likelihood estimated.
Because in such a case,
it's mainly going to be determined by
this likelihood value, the same as here.
But if we have some not informative prior,
some bias towards
the different values then map estimator
can allow us to incorporate that.
But the problem here of course,
is how to define the prior.
There is no free lunch and if you want to
solve the problem with more knowledge,
we have to have that knowledge.
And that knowledge,
ideally, should be reliable.
Otherwise, your estimate may not
necessarily be more accurate than that
maximum likelihood estimate.
So, now let's look at the Bayesian
estimation in more detail.
So, I show the theta values as just a one
dimension value and
that's a simplification of course.
And so, we're interested in which
variable of theta is optimal.
So now, first we have the Prior.
The Prior tells us that
some of the variables
are more likely the others would believe.
For example, these values are more
likely than the values over here,
or here, or other places.
So this is our Prior, and
then we have our theta likelihood.
And in this case, the theta also tells us
which values of theta are more likely.
And that just means loose syllables
can best expand our theta.
And then when we combine the two
we get the posterior distribution,
and that's just a compromise of the two.
It would say that it's
somewhere in-between.
So, we can now look at some
interesting point that is made of.
This point represents the mode of prior,
that means the most likely parameter
value according to our prior,
before we observe any data.
This point is the maximum
likelihood estimator,
it represents the theta that gives
the theta of maximum probability.
Now this point is interesting,
it's the posterior mode.
It's the most likely value of the theta
given by the posterior of this.
And it represents a good
compromise of the prior mode and
the maximum likelihood estimate.
Now in general in Bayesian inference,
we are interested in
the distribution of all these
parameter additives as you see here.
If there's a distribution over
see how values that you can see.
Here, P of theta given X.
So the problem of Bayesian inference is
to infer this posterior, this regime, and
also to infer other interesting
quantities that might depend on theta.
So, I show f of theta here
as an interesting variable
that we want to compute.
But in order to compute this value,
we need to know the value of theta.
In Bayesian inference,
we treat theta as an uncertain variable.
So we think about all
the possible variables of theta.
Therefore, we can estimate the value of
this function f as extracted value of f,
according to the posterior distribution
of theta, given the observed evidence X.
As a special case, we can assume f
of theta is just equal to theta.
In this case,
we get the expected value of the theta,
that's basically the posterior mean.
That gives us also one point of theta, and
it's sometimes the same as posterior mode,
but it's not always the same.
So, it gives us another way
to estimate the parameter.
So, this is a general illustration of
Bayesian estimation and its an influence.
And later,
you will see this can be useful for
topic mining where we want to inject
the sum prior knowledge about the topics.
So to summarize,
we've used the language model
which is basically probability
distribution over text.
It's also called a generative model for
text data.
The simplest language model
is Unigram Language Model,
it's basically a word distribution.
We introduced the concept
of likelihood function,
which is the probability of
the a data given some model.
And this function is very important,
given a particular set of parameter
values this function can tell us which X,
which data point has a higher likelihood,
higher probability.
Given a data sample X,
we can use this function to determine
which parameter values would maximize
the probability of the observed data,
and this is the maximum
livelihood estimate.
We also talk about the Bayesian
estimation or inference.
In this case we, must define a prior
on the parameters p of theta.
And then we're interested in computing the
posterior distribution of the parameters,
which is proportional to the prior and
the likelihood.
And this distribution would allow us then
to infer any derive that is from theta.
[MUSIC]

